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Abstract — A wavelet scattering network computes a translation invari- 
ant image representation, which is stable to deformations and preserves 
high frequency information for classification. It cascades wavelet trans- 
form convolutions with non-linear modulus and averaging operators. The 
first network layer outputs SIFT-type descriptors whereas the next layers 
provide complementary invariant information which improves classifica- 
tion. The mathematical analysis of wavelet scattering networks explain 
important properties of deep convolution networks for classification. 

A scattering representation of stationary processes incorporates 
higher order moments and can thus discriminate textures having same 
Fourier power spectrum. State of the art classification results are ob- 
tained for handwritten digits and texture discrimination, with a Gaussian 
kernel SVM and a generative PGA classifier. 



1 Introduction 

A major difficulty of image classification comes from the 
considerable variability within image classes and the in- 
ability of Euclidean distances to measure image similari- 
ties. Part of this variability is due to rigid translations, ro- 
tations or scaling. This variability is often uninformative 
for classification and should thus be eliminated. In the 
framework of kernel classifiers [31 J, metrics are defined 
as a Euclidean distance applied on a representation ^{x) 
of signals x. The operator $ must therefore be invariant 
to these rigid transformations. 

Non-rigid deformations also induce important vari- 
ability within object classes [31/ ESI/ ED- For instance, 
in handwritten digit recognition, one must take into ac- 
count digit deformations due to different writing styles. 
However, a full deformation invariance would reduce 
discrimination since a digit can be deformed into a 
different digit, for example a one into a seven. The rep- 
resentation must therefore not be deformation invariant 
but continuous to deformations, to handle small defor- 
mations with a kernel classifier. A small deformation 
of an image x into x' should correspond to a small 
Euclidean distance ||<l>(x) - <l>(x')|| in the representation 
space, as further explained in Section |2l 

Translation invariant representations can be con- 
structed with registration algorithms [32J or with the 
Fourier transform modulus. However, Section 12.11 ex- 
plains why these invariants are not stable to deforma- 
tions and hence not adapted to image classification. 
Trying to avoid Fourier transform instabilities suggests 
replacing sinusoidal waves by localized waveforms such 



as wavelets. However, wavelet transforms are not invari- 
ant to translations. Building invariant representations 
from wavelet coefficients requires introducing non-linear 
operators, which leads to a convolution network archi- 
tecture. 

Deep convolution networks have the ability to build 
large-scale invariants which are stable to deformations 
[18J. They have been applied to a wide range of image 
classification tasks. Despite the remarkable successes 
of this neural network architecture, the properties and 
optimal configurations of these networks are not well 
understood because of cascaded non-linearities. Why use 
multiple layers ? How many layers ? How to optimize 
filters and pooling non-linearities ? How many internal 
and output neurons ? These questions are mostly an- 
swered through numerical experimentations that require 
significant expertise. 

Deformation stability is obtained with localized 
wavelet filters which separate the image variations at 
multiple scales and orientations [22] . Computing a non- 
zero translation invariant representation from wavelet 
coefficients requires introducing a non-linearity, which is 
chosen to be a modulus to optimize stability [6J. Wavelet 
scattering networks, introduced in Il23l , [|22|, build trans- 
lation invariant representations with average poolings 
of wavelet modulus coefficients. The output of the first 
network layer is similar to SIFT ||2TI or Daisy [33] type 
descriptors. However, this limited set of locally invariant 
coefficients is not sufficiently informative to discriminate 
complex structures over large-size domains. The infor- 
mation lost by the averaging is recovered by computing 
a next layer of invariant coefficients, with the same 
wavelet convolutions and average modulus poolings. A 
wavelet scattering is thus a deep convolution network 
which cascades wavelet transforms and modulus opera- 
tors. The mathematical properties of scattering operators 
[22] explain how these deep network coefficients relate to 
image sparsity and geometry. The network architecture 
is optimized in Section |3l to retain important information 
while avoiding useless computations. 

A scattering representation of stationary processes is 
introduced for texture discrimination. As opposed to 
the Fourier power spectrum, it provides information 
on higher order moments and can thus discriminate 
non-Gaussian textures having the same power spec- 



trum. Classification applications are studied in Section 
14.11 Scattering classification properties are demonstrated 
with a Gaussian kernel SVM and a generative classi- 
fier, which selects affine space models computed with 
a PCA. State-of-the-art results are obtained for hand- 
written digit recognition on MNIST and USPS databes, 
and for texture discrimination. Software is available at 
www.cmap.polytechnique.fr/scattering. 

2 Towards a Convolution Network 

Section l2Jl formalizes the deformation stability condition 
as a Lipschitz continuity property, and explains why 
high Fourier frequencies are source of unstabilites. Sec- 
tion |2]2] introduces a wavelet-based scattering transform, 
which is translation invariant and stable to deformations, 
and section IZ3l describes its convolutional network archi- 
tecture. 

2.1 Fourier and Registration Invariants 

A representation ^{x) is invariant to global translations 

Lcx{u) = x{u — c) by c = (ci, C2) G M? if 



^{L,x) = ^x) . 



(1) 



A canonical invariant iFTSl , 1321 ^{x) = x{u — a{x)) 
registers x with an anchor point a(x), which is translated 
when X is translated: a{Lcx) = a{x) + c. It is therefore 
invariant: ^{Lcx) = ^{x). For example, the anchor point 
may be a filtered maxima a{x) = argmax^ \x ^ h{u)\, for 
some filter h{u). 

The Fourier transform modulus is another example 
of translation invariant representation. Let x{uj) be the 
Fourier transform of x{u). Since Lcx{uj) = e~*^-^f(cj), it 
results that \Lcx\ = \x\ does not depend upon c. 

To obtain appropriate similarity measurements be- 
tween images which have undergone non-rigid trans- 
formations, the representation must also be stable to 
small deformations. A small deformation can be written 
Lrx{u) = x{u — r{u)) where t{u) depends upon u 
and thus deforms the image. The deformation gradient 
tensor Vt{u) is a matrix whose norm \Vt{u)\ measures 
the deformation amplitude at u. A small deformation 
is an invertible transformation if \Vt{u)\ < 1 ||2|, 1341 . 
Stability to deformations is expressed as a Lipschitz 
continuity condition relative to this deformation metric: 



\\^{Lrx) - ^{x)\\ < C \\x\\ sup \\/r{u) 



(2) 



where ||x|p = J \x{u)\'^ du. This property implies global 
translation invariance, because if t{u) = c then Vt{u) = 
0, but it is much stronger. 

A Fourier modulus is translation invariant but un- 
stable with respect to deformations at high frequencies. 
Indeed, | \x{uj)\ — \Lrx{uj)\ \ can be arbitrarily large at a 
high frequency a;, even for small deformations and in 
particular small dilations. As a result, ^{x) = \x\ does 
not satisfy the deformation continuity condition (O Il22ll . 
A Fourier modulus also loses too much information. 



For example, a Dirac S{u) and a linear chirp e*^ are 
totally different signals having Fourier transforms whose 
moduli are equal and constant. Very different signals 
may not be discriminated from their Fourier modulus. 

A registration invariant ^(x) = x{u — a{x)) carries 
more information than a Fourier modulus, and charac- 
terizes X up to a global absolute position information 
p2|l . However, it has the same high-frequency instability 
as a Fourier transform. Indeed, for any choice of anchor 
point a(x), applying the Plancherel formula proves that 

\\x{u-a{x))-x\u-a{x'))\\ > {27r)-^ \\\x{lj)\ - \x\uj)\\\ . 

(3) 
If x' = LrX, the Fourier transform instability at high 
frequencies implies that $(x) = x{u — a{x)) is also 
unstable with respect to deformations. 

2.2 Scattering Wavelets 

A wavelet is a localized waveform and is thus stable 
to deformation, as opposed to the Fourier sinusoidal 
waves. A wavelet transform computes convolutions with 
wavelets. It is thus translation covariant, not invariant. 
A scattering transform computes non-linear invariants 
with modulus and averaging pooling functions. 

Two-dimensional directional wavelets are obtained by 
scaling and rotating a single band-pass filter ijj. Let G 
be a discrete, finite rotation group in R^. Multiscale 
directional wavelet filters are defined for any j G Z and 
rotation r e G hy 



V^2.vH = 2^^^{2^r-\) 



(4) 



If the Fourier transform i^{oj) is centered at a frequency 
T] then ilj2Jr{^) = V^(2~-^r~^cj) has a support centered at 
2^rr], with a bandwidth proportional to 2^. To simplify 
notations, we denote A = 2^r G A = G x Z, and |A| =2^. 

A wavelet transform filters x using a family of 
wavelets: {x^ V^a(^)}a- It is computed with a filter bank 
of dilated and rotated wavelets having no orthogonality 
property. As further explained in Section 13. li it is stable 
and invertible if the rotated and scaled wavelet filters 
cover the whole frequency plane. On discrete images, 
to avoid aliasing, we only capture frequencies in the 
circle \uj\ < tt inscribed in the image frequency square. 
However, most digital natural images and textures have 
negligible energy outside this frequency circle. 

Let u.u' and \u\ denote the inner product and norm in 
R^. A Morlet wavelet t/j is an example of wavelet given 
by 

where C2 is adjusted so that / iIj{u) du = 0. Figure [l] 
shows the Morlet wavelet with a = 0.85 and ^ = 37r/4, 
used in all classification experiments. 

A wavelet transform commutes with translations, and 
is therefore not translation invariant. To build a transla- 
tion invariant representation, it is necessary to introduce 
a non-linearity. If i? is a linear or non-linear operator 
which commutes with translations, R{Lcx) = LcRx, 




\ 




(a) (b) (c) 

Fig. 1 . Complex Morlet wavelet, (a): Real part of iIj. (b): Imaginary part of tp. (c): Fourier modulus \ip\. 



then the integral / Rx{u) du is translation invariant. 
Applying this to Rx = x i^ ipx gives a trivial invariant 
j X -k \ljx{u)du = for all x because j\ljx{u)du = 0. 
If Rx = M{x i^ ipx) but M is linear and commutes 
with translations then the integral still vanishes, which 
imposes choosing a non-linear M. Taking advantage of 
the wavelet transform stability to deformations, to obtain 
integrals which are also stable to deformations we also 
impose that M commutes with deformations 

Vr(ii) , MLr = LrM . 

By adding a weak differentiability condition, one can 
prove [6] that M must necessarily be a pointwise op- 
erator, which means that Mx(u) only depends on the 
value x{u). If we also impose an L^(R^) stability 

V(x,y) G L2(R^)2 , \\Mx\\ = \\x\\ and \\Mx-My\\ < \\x-y\ 

then one can verify [6J that necessarily Mx = e*" |x|, 
and we set a = 0. The resulting translation invariant 
coefficients are therefore L^(R^) norms: ||x^V^a||i = / |^^ 
iljx{u)\du. 

The L^(R^) norms {||x ^ ?/^a||i}a form a crude sig- 
nal representation, which measures the sparsity of the 
wavelet coefficients. For appropriate wavelets, one can 
prove [36J that x can be reconstructed from {|x^V^a(^)|}a/ 
up to a multiplicative constant. The information loss 
thus comes from the integration of |x ^ ?/^a(^)|/ which 
removes all non-zero frequency components. These non- 
zero frequencies can be recovered by calculating the 
wavelet coefficients {|x^V^Ai |^V^A2(^)}a2 of k^V^Ai |- Their 
L^(R^) norms define a much larger family of invariants, 
for all Ai and A2: 



|x^?/;aJ^V^A2||] 



\\xi^iljx^{u)\i^iljX2\du. 



More translation invariant coefficients can be com- 
puted by further iterating on the wavelet transform and 
modulus operators. Let U[X]x = \xi^ipx\. Any sequence 
p = (Ai, A2, ..., Arn) defines a path, i.e, the ordered product 
of non-linear and non-commuting operators 

U[p]x = U[Xm] ... U[X2] U[Xi]x = I IIx^V^aJ^V^a^I ... \^^xj , 

with U[9]x = X. A scattering transform along the path p 
is defined as an integral, normalized by the response of 



a Dirac: 



du 



Each scattering coefficient Sx{p) is invariant to a trans- 
lation of X. We shall see that this transform has many 
similarities with the Fourier transform modulus, which 
is also translation invariant. However, a scattering is 
Lipschitz continuous to deformations as opposed to the 
Fourier transform modulus. 

For classification, it is often better to compute localized 
descriptors which are invariant to translations smaller 
than a predefined scale 2^, while keeping the spatial 
variability at scales larger than 2^. This is obtained by 
localizing the scattering integral with a scaled spatial 
window (j)2j{u) = 2~'^'^ (j){2~'^u). It defines a windowed 
scattering transform in the neighborhood of u: 



Sj[p]x{u) = U[p]xi^4>2j{u) 



and hence 



/ U[p]x{v)^2'J 



{u — v) dv 



Sj[p]x{u) = I \\xi.^|Jx,\i^^|^X2\^'^\^i^xJ^^2J{u) , 

with 5'j[0]x = X i^ (j)2J- For each path p, Sj[p]x{u) is 
a function of the window position u, which can be 
subsampled at intervals proportional to the window size 
2'^ . The averaging by ^2^ implies that Sj[p]x{u) is nearly 
invariant to translations Lcx{u) = x{u — c) if \c\ <C 2^. 
Section 13.11 proves that it is also stable relatively to 
deformations. 



2.3 Scattering Convolution Network 

If p is a path of length m then Sj[p]x{u) is called 
scattering coefficient of order m at the scale 2^. It is 
computed at the layer m of a convolution network which 
is specified. For large scale invariants, several layers are 
necessary to avoid losing crucial information. 

For appropriate wavelets, first order coefficients 
Sj[Xi]x are equivalent to SIFT coefficients [|2T|. Indeed, 
SIFT computes the local sum of image gradient ampli- 
tudes among image gradients having nearly the same 
direction, in a histogram having 8 different direction 
bins. The DAISY approximation [33J shows that these 
coefficients are well approximated by Sj[2^r]x = |x ^ 
V^2Jr| ^ 4^2 J {^) where V^2Jr is the partial derivative of a 



Gaussian computed at the finest image scale 2^, for 8 
different rotations r. The averaging filter (/)2j is a scaled 
Gaussian. 

Partial derivative wavelets are well adapted to detect 
edges or sharp transitions but do not have enough fre- 
quency and directional resolution to discriminate com- 
plex directional structures. For texture analysis, many 
researchers fT9l, [30J, [28J have been using averaged 
wavelet coefficient amplitudes \x i^ipxl ^ <l^j{u)r but cal- 
culated with a complex wavelet ip having a better fre- 
quency and directional resolution. 

A scattering transform computes higher-order coeffi- 
cients by further iterating on wavelet transforms and 
modulus operators. At a maximum scale 2^, wavelet 
coefficients are computed at frequencies 2^ > 2~^, and 
lower frequencies are filtered by 02^ (^) = 2~'^'^ (j){2~'^u). 
Since images are real-valued signals, it is sufficient to 
consider "positive'' rotations r G G+ with angles in [0, tt): 



W^ 



^JX{U) = [xi^(l)2j{u) , X^V^a(^)| 



AGAj 



(5) 



with Aj = {A = 2h : r e G"',] > -J}. For a 
Morlet wavelet ifj, the averaging filter ^ is chosen to be 
a Gaussian. Let us emphasize that 2^ is a spatial scale 
variable whereas A = 2^r is assimilated to a frequency 
variable. 

A wavelet modulus propagator keeps the low- 
frequency averaging and computes the modulus of com- 
plex wavelet coefficients: 



Ujx{u) = [x'k(j)2j{u) , \xi<^lJx{u)\\ 



xeAj 



Let Ay be the set of all paths p = (Ai, ..., A^) of length 

m. We denote U[AJ]x = {U[p]x}peAj and Sj[AJ]x = 
{Sj[p]x}peAj. Since 

UjU[p]x= ^U[p]xir(l)2J , \U[p]xirtljx\j , 
and Sj[p]x = U[p\x-k (j)2J , it results that 

UjU[K^]x = {UjU[p]x}peAj = {Sj[A7]x, t/[A7+V} • 

(7) 
This implies that Sj[p]x can be computed along paths 
of length m < mmax by first calculating Ujx = 
{5'j[0]x, U[A^j]x} and iteratively applying Uj to each 
U[Ay']x for increasing m < mmax- This algorithm is 
illustrated in Figure |2l 

A scattering transform thus appears to be a deep 
convolution network [18 J, with some particularities. As 
opposed to most convolution networks, a scattering net- 
work outputs coefficients Sj[p]x at all layers m < mmax/ 
and not just at the last layer mmax 118J. The next section 
proves that the energy of the deepest layer converges 
quickly to zero as mmax increases. 

A second distinction is that filters are not learned from 
data but are predefined wavelets. Wavelets are stable 
with respect to deformations and provide sparse image 
representations. Stability to deformations is a strong 



condition which imposes a separation of the different 
image scales [22l, hence the use of wavelets. 

The modulus operator which recombines real and 
imaginary parts can be interpreted as a pooling function 
in the context of convolution networks. The averaging 
by (j)2J at the output is also a pooling operator which 
aggregates coefficients to build an invariant. It has been 
argued [7] that an average pooling loses information, 
which has motivated the use of other operators such 
as hierarchical maxima [8J. The high frequencies lost 
by the averaging are recovered as wavelet coefficients 
in the next layers, which explains the importance of 
using a multilayer network structure. As a result, it 
only loses the phase of these wavelet coefficients. This 
phase may however be recovered from the modulus 
thanks to the wavelet transform redundancy. It has been 
proved [36] that the wavelet-modulus operator Ujx = 
{x ^ ^2^5 1^ ^ i^x\}xeAj is invertible with a continuous 
inverse. It means that x and hence the complex phase 
of each x -k ijjx can be reconstructed. Although Uj is 
invertible, the scattering transform is not exactly invert- 
ible because of instabilities. Indeed, applying Uj in 
for m < mmax computes all Sj[A7]x for m < mmax 
but also the last layer of internal network coefficients 
^j^m^ax+i]^ The next section proves that [/[A^^^^^+V 
can be neglected because its energy converges to zero as 
^max increases. However, this introduces a small error 
which accumulates when iterating on Uj^. 

Scattering coefficients can be displayed in the fre- 
quency plane. Let {^[p]}pGA^ be a partition of M?. To 



(6) each frequency cj G M we associate the path p{ij 



such that UJ G 0.[p\. We display S j[p{uo)\x{u) , which 
is a piece wise constant function of cj G M^, for each 
position u and each m = 1, 2. For m = 1, each Q.[2^^ri] is 
chosen to be a quadrant rotated by ri, to approximate the 
frequency support of V^2Jiri/ whose size is proportional 
to ||V^2Jiri IP ^rid hence to 2^^. This defines a partition of 
a dyadic annulus illustrated in Figure (Sfa). For m = 2, 
r^ [2-^^ r 1,2-^2^2] is obtained by subdividing Vt[2^^ri], as 
illustrated in Figure Ob). Each l][2^^ri] is subdivided 
along the radial axis into quadrants indexed by J2 • Each 
of these quadrants are themselves subdivided along the 
angular variable into rotated quadrants VL[2^^ri,2^'^r2] 
having a surface proportional to |||V^2Jiril ^ V^2J2r2lP- 

Figure IH shows the Fourier transform of two images, 
and the amplitude of their scattering coefficients of 
orders m = 1 and m = 2, at a maximum scale 2^ equal to 
the image size. A scattering coefficient over a quadrant 
l^[2^^ri] gives an approximation of the Fourier transform 
energy over the support of V^2Jiri- Although the top 
and bottom images are very different, they have same 
order m = 1 scattering coefficients. Here, first-order 
coefficients are not sufficient to discriminate between 
two very different images. However, coefficients of order 
m = 2 succeed in discriminating between the two 
images. The top image has wavelet coefficients which are 
much more sparse than the bottom image. As a result. 
Section 13.11 shows that second-order scattering coeffi- 




m=0 



m=l 



m=2 



Fig. 2. A scattering propagator Uj applied to x computes each U[Xi]x = I^^V^aJ and outputs Sj[(l)]x = x^(j)2J (black 
arrow). Applying Uj to each [/[Ai]x computes all /7[Ai,A2]x and outputs 5'j[Ai] = /7[Ai]^^2^ (black arrows). Applying 
Uj iteratively to each U[p]x outputs Sj[p]x = U[p]x ^ ^2^ (black arrows) and computes the next path layer. 



Q[2^^ri,23^r2] 




(a) 



(b) 



Fig. 3. For m = 1 and m = 2, a scattering is displayed as piecewise constant functions equal to Sj[p]x{u) over each 
frequency subset Q[p]. (a): For m = 1, each n[2^^ri] is a rotated quadrant of surface proportional to 2^k (b): For m = 2, 
each Q[2^^ri] is subdivided into a partition of subsets Q[2^^ri,2^^r2]. 



cients have a larger amplitude. Higher-order coefficients 
are not displayed because they have a negligible energy 
as explained in Section [3l 



3 Scattering Properties 

A convolution network is highly non-linear, which 
makes it difficult to understand how the coefficient 
values relate to the signal properties. For a scattering 
network. Section 13.1! analyzes the coefficient properties 
and optimizes the network architecture. For texture anal- 
ysis, the scattering transform of stationary processes 
is studied in Section 13.21 The regularity of scattering 
coefficients can be exploited to reduce the size of a 
scattering representation, by using a cosine transform, 
as shown in Section l33l Finally, Section |3^ provides a 
fast computational algorithm. 



3.1 Energy Conservation and Deformation Stability 

A windowed scattering Sj is computed with a cascade 
of wavelet modulus operators Uj, and its properties 
thus depend upon the wavelet transform properties. 
Conditions are given on wavelets to define a scattering 
transform which is contracting and preserves the sig- 
nal norm. This analysis shows that ||^j[p]a;|| decreases 
quickly as the length of p increases, and is non-negligible 
only over a particular subset of frequency-decreasing 
paths. Reducing computations to these paths defines 
a convolution network with much fewer internal and 
output coefficients. 

The norm of a sequence of transformed signals Rx = 
{9n}nen is defined by \\Rx\\^ = Eneo IbniP- If x is real 
and there exists e > such that for all cj G M^ 

-| CX) 

1 - e < |0H|2 + 2 E E m-'rco)\' < 1 . (8) 

j=l reG 

then applying the Plancherel formula proves that Wjx = 






(a) 



(b) 



(c) 



(d) 



Fig. 4. Scattering display of two images having the same first order scattering coefficients, (a) Image x{u). (b) Fourier 
modulus \x{lo)\. (c) Scattering Sjx[p{lo)] for m = 1. (d) Scattering Sjx[p{lo)] for m = 2. 



{x-k<pj , x-k tp\}xeAj satisfies 

{l-e)\\xf<\\Wjxf<\\xf, 



(9) 



of the last network layer converges to zero when m^ 
increases 



with 



\Wjxr = \\x*<Pjf+ExeAj\\^*i^^ 
following we suppose that e < 1 and hence that the 
wavelet transform is a contracting and invertible opera- 
tor, with a stable inverse. If e = then Wj is unitary. The 
Morlet wavelet ip in Figure [l] satisfies ^ with e = 0.25, 
together with (l){u) = Cexp(-|?x|V(2crg)) with ctq = 0.7 
and C adjusted so that ^ (j){u)du = 1. These functions 
are used in all classification applications. Rotated and 
dilated cubic spline wavelets are constructed in [22J to 
satisfy ® with e = 0. 

The modulus is contracting in the sense that | |a| - 16| | < 
\a—h\. Since Uj = {x^^j , |x^V^a|}agAj is obtained with a 
wavelet transform Wj followed by modulus, which are 
both contractive, it is also contractive: 

\\Ujx-Ujy\\<\\x-y\\ . 

If Wj is unitary then Uj also preserves the signal norm 

\\Ujx\\ = \\x\\. 

Let Vj = Um>o^J be the set of all possible paths of 
any length m G N. The norm of Sj[Vj]x = {Sjlplxjp^-pj 
is \\Sj[Vj]x\\^ = Epevj WSjIpM^- Since Sj iteratively 
applies Uj which is contractive, it is also contractive: 

\\Sjx-Sjy\\<\\x-y\\. 

If Wj is unitary, e = in (|9j and for appropriate 
wavelets, it is proved in [22J that 

CX) oo 

WSjxf = J2 \\Sj[A7M' = E E \\Sj\pH' = M' ■ 

(10) 
This result uses the fact that Uj preserves the sig- 
nal norm and that UjU[AJ]x = {Sj[AJ]x, U[Ay^^]x}. 
Proving (p^ is thus equivalent to prove that the energy 



2. In the lim \\U[Ay-^]a 



= lim Yl \\Sj[A7M' = 0. 

(11) 
This result is also important for numerical applications 
because it explains why the network depth can be lim- 
ited with a negligible loss of signal energy. 

The scattering energy conservation also provides a 
relation between the network energy distribution and 
the wavelet transform sparsity For p = (Ai,...,Am), 
we denote p + A = (A, Ai, ..., A^). Applying (p^ to 
/7[A]x = \x -k i\)\\ instead of x, and separating the first 
term for m = yields 

CX) 

l|5j[A]xf +^ ^ ||5j[A + p]x|p = ||x*^,f . (12) 

But 5'j[A]x = \x-ki\)\\ -k (j)2J is a local L^(R^) norm and 
one can prove ES that limj^oo 2'^'^\\Sj[\]x\\^ = ||(/)|p \\xi. 
V^aIIi- The more sparse xiriljx{u) the smaller ||x^?/^a||i and 
(O implies that the total energy 5])m=i T^peAj II^^I^ + 
A]x|p of higher-order scattering terms is then larger. 
Figure |4] shows two images having same first order scat- 
tering coefficients, but the top image is piecewise regular 
and hence has wavelet coefficients which are much more 
sparse than the uniform texture at the bottom. As a result 
the top image has second order scattering coefficients of 
larger amplitude than at the bottom. For typical images, 
as in the CalTechlOl dataset [10], Table [l] shows that the 
scattering energy has an exponential decay as a function 
of the path length m. As proved by (pT) , the energy of 
scattering coefficients converges to as m increases and 
is below 1% for m > 3. 

The energy conservation ([T0)| is proved by showing 
that the scattering energy ||t/[p]x|P propagates towards 



TABLE 1 

This table gives the percentage of scattering energy 

||5'j(A7)x|p/||x|p captured by frequency-decreasing 

paths of length m, as a function of J. These are 

averaged values computed over normalized images with 

J x{u)du = and ||x|| = 1, in the Caltech-101 database. 

The scattering is computed with cubic spline wavelets. 
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m = 


m = 1 


m = 2 


m = 3 


m = 4 


m < 3 


1 


95.1 


4.86 


- 


- 


- 


99.96 


2 


87.56 


11.97 


0.35 


- 


- 


99.89 


3 


76.29 


21.92 
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lower frequencies as the length of p increases. This 
energy is thus ultimately captured by the low-pass filter 
(\)2J which outputs Sj\j)\x = U[p]xir(j)2J . This property re- 
quires that xi^ipx has a lower-frequency envelope |x^?/^a|- 
It is valid if tp{u) = e*^-^ 0{u) where is a low-pass filter. 
To verify this property, we write xi^iljx{u) = e^^^-'^ xx{u) 



with 



xx{u) = {e-'^^-^x{u))^ex{u) 



This signal is filtered by the dilated and rotated low-pass 
filter 9x whose Fourier transform is 9x{^) = 0{X~^uj). So 
\xi^iljx{u)\ = |^a(^)| is the modulus of a regular function 
and is therefore mostly regular. This result is not valid 
if ?/^ is a real because |x ^ V^a| is singular at each zero- 
crossing of Xirlljx{u). 

The modulus appears as a non-linear "demodulator" 
which projects wavelet coefficients to lower frequencies. 
If A = 2^r then \x ^ V^a(^)| ^ ^X' for A' = 2^'r' is 
non-negligible only if ipx' is located at low frequencies 
and hence if 2^ < 2^. Iterating on wavelet modulus 
operators thus propagates the scattering energy along 
frequency-decreasing paths p = (2-^^ri, ...,2-^"^rm) where 
2Jk < 2^^-i, for 1 < k < m. Scattering coefficients 
along other paths have a negligible energy. Over the Cal- 
TechlOl images database. Table [l] shows that over 99% 
of the scattering energy is concentrated along frequency- 
decreasing paths of length m < 3. Numerically, it is 
therefore sufficient to compute the scattering transform 
along this subset of frequency-decreasing paths. It de- 
fines a much smaller convolution network. Section 13.41 
shows that the resulting coefficients are computed with 
0{N log N) operations. 

For classification applications, one of the most impor- 
tant properties of a scattering transform is its stability to 
deformations Lrx{u) = x{u — t{u)), because wavelets are 
stable to deformations and the modulus commutes with 
Lr. Let ||r||oo = sup^ \t{u)\ and ||Vr||oo = sup^ |Vr(ii)| < 
l.li Sj is computed on paths of length m < mmax then 
it is proved in [|22| that for signals x of compact support 



\\Sj{Lrx) - Sjx\\ < Crrin 



h-'\\r\\ 



with a second order Hessian term which is negligible if 
t{u) is regular. If 2^ > ||r||cx)/|| Vr||oo then the translation 
term can be neglected and the transform is Lipschitz 
continuous to deformations: 



\\Sj{Lrx)-Sjx\\ < Cm^ax|k||||Vr||, 



(14) 



3.2 Scattering Stationary Processes 

Image textures can be modeled as realizations of sta- 
tionary processes X{u). We denote the expected value 
of X by E{X), which does not depend upon u. The 
Fourier spectrum RX{uj) is the Fourier transform of the 
autocorrelation 

RX{r) = e(^[X{u) - E{X)][X{u - r) - E{X)] 

Despite the importance of spectral methods, the Fourier 
spectrum is often not sufficient to discriminate image 
textures because it does not take into account higher- 
order moments. Figure |5] shows very different textures 
having same second-order moments. A scattering repre- 
sentation of stationary processes includes second order 
and higher-order moment descriptors of stationary pro- 
cesses, which discriminates between such textures. 

If X{u) is stationary then U[p]X{u) remains stationary 
because it is computed with a cascade of convolutions 
and modulus operators which preserve stationarity Its 
expected value thus does not depend upon u and defines 
the expected scattering transform: 

^X{p) = E{U[p]X) . 

A windowed scattering gives an estimator of SX{p), 
calculated from a single realization of X, by averaging 
U[p\X with (l)2J\ 

Sj[p\X{u) = U[p]X^^2A^) • 

Since j(j)2j{u)du = 1, this estimator is unbiased: 
E{Sj[p]X) = E{U[p]X). 

For appropriate wavelets, it is also proved [|22| that 

Y, E{\Sj[p]X\') = E{\X\^) . (15) 

peVj 

Replacing X by X ^ V^a implies that 

Y, mSj[p^ X]X\') = E{\X ^^x\') . 
peVj 

These expected squared wavelet coefficients can also be 
written as a filtered integration of the Fourier power 
spectrum RX{uj) 

E{\Xi.iljx\^)= I RX{uj)\ij{\-^uj)\^duj . 



iVrll 



(13) 



These two equations prove that summing scattering coef- 
ficients recovers the power spectrum integral over each 
wavelet frequency support, which only depends upon 
second-order moments. However, one can also show that 
scattering coefficients SX{p) depend upon moments of 
X up to the order 2^ if p has a length m. Scattering 
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(a) 



(b) 



(c) 



(d) 



Fig. 5. Two different textures having the same Fourier power spectrum, (a) Textures X{u). Top: Brodatz texture. 
Bottom: Gaussian process, (b) Same estimated power spectrum RX{uj). (c) Nearly same scattering coefficients 
Sj[p]X for m = 1 and 2^ equal to the image width, (d) Different scattering coefficients Sj[p]X for m = 2. 



coefficients can thus discriminate textures having same 
second-order moments but different higher-order mo- 
ments. This is illustrated using the two textures in Figure 
O which have the same power spectrum and hence same 
second order moments. Scattering coefficients Sj[p]X are 
shown for m = 1 and m = 2 with the frequency tiling 
illustrated in Figure O The ability to discriminate the top 
process Xi from the bottom process X2 is measured by 
a scattering distance normalized by the variance: 



p{m) 



\\SjX,[AJ]-E{SjX2[AJ]W 
E{\\SjX2[A^]-E{SjX2[AJ]W) 



For m = 1, scattering coefficients mostly depend upon 
second-order moments and are thus nearly equal for 
both textures. One can indeed verify numerically that 
p(l) = 1 so both textures can not be distinguished 
using first order scattering coefficients. On the contrary, 
scattering coefficients of order 2 are highly dissimilar 
because they depend on moments up to order 4, and 
p(2) = 5. 

For a large class of ergodic processes including most 
image textures, it is observed numerically that the to- 
tal scattering variance YjpeVj ^i\'^j[p\^ ~ ^-^(p)P) de- 
creases to zero when 2^ increases. Table |2] shows the de- 
cay of the total scattering variance, computed on average 
over the Brodatz texture dataset. Since E{\Sj[p]X\'^) = 
E{Sj[p]X)^^E{\Sj[p]X-E{Sj[p]X)\^) and E{Sj[p]X) = 
SX{p), it results from the energy conservation ([15| that 
the expected scattering transform also satisfies 

CX) 

\\SXf=Y. E \SX{p)\^ = E{\X\^) . 

m=OpeA^ 



TABLE 2 
Decay of th^ total scattering variance 

EpeVj^(\SAp]X-SX{p)\^)/E{\X\^)\n percentage, as 

a function of J, averaged over the Brodatz dataset. 

Results obtained using cubic spline wavelets. 



J = 1 J -- 



85 



65 



45 



26 



14 



2.5 



TABLE 3 

Percentage of expected scattering energy 

J2peA^ |6^X(p)p, as a function of the scattering order m, 

computed with cubic spline wavelets, over the Brodatz 

dataset. 
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m = 1 


m = 2 


m = 3 


m = 4 





74 


19 


3 


0.3 



Table |3] gives the percentage of expected scattering en- 
ergy XlpGA^ l*5'X(p)p carried by paths of length m, for 
textures in the Brodatz database. Most of the energy is 
concentrated in paths of length m < 3. 



3.3 Cosine Scattering Transform 

Natural images have scattering coefficients Sj[p]X{u) 
which are correlated across paths p = (2-^^ri, ...,2-^'^r^), 
at any given position u. The strongest correlation is 
between paths of same length. For each m, scattering 
coefficients are decorrelated in a Karhunen-Loeve basis 
which diagonalizes their covariance matrix. Figure [6] 
compares the decay of the sorted variances E{\Sj[p]X — 



E{Sj[p\X)\^) and the variance decay in the Karhunen- 
Loeve basis computed on paths of length m = 1, and on 
paths of length m = 2, over the Caltech image dataset 
with a Morlet wavelet. The variance decay is much faster 
in the Karhunen-Loeve basis, which shows that there is a 
strong correlation between scattering coefficients of same 
path length. 

A change of variables proves that a rotation and scal- 
ing X2ir{u) = X{2~Wu) produces a rotation and inverse 
scaling on the path variable p = (2-^^ri, ..., 2-^'^r^): 

'SX2ir{v) =SX{2^rp) where 2^rp = (2^+^'Vri, ..., 2^+^-rr^ 

If images are randomly rotated and scaled by 2V~^ then 
the path p is randomly rotated and scaled Il27l . In this 
case, the scattering transform has stationary variations 
along the scale and rotation variables. This suggests 
approximating the Karhunen-Loeve basis by a cosine 
basis along these variables. Let us parameterize each 
rotation r by its angle G [0,27r]. A path p is then 
parameterized by ([ji, 6>i], ..., [j^, ^m])- 

Since scattering coefficients are computed along fre- 
quency decreasing paths for which —J<jk < jk-ir to 
reduce boundary effects, a separable cosine transform 
is computed along the variables ji = ji , J2 = J2 - 
ji, ...,jm = jm - jm-ir and aloug each angle variable 
6>i, ^2, "',0m- We define the cosine scattering transform 
as the coefficients obtained by applying this separable 
discrete cosine transform along the scale and angle 
variables of Sj[p]X{u), for each u and each path length 
m. Figure [6] shows that the cosine scattering coefficients 
have variances for m = 1 and m = 2 which decay 
nearly as fast as the variances in the Karhunen-Loeve 
basis. It shows that a DCT across scales and orientations 
is nearly optimal to decorrelate scattering coefficients. 
Lower-frequency DCT coefficients absorb most of the 
scattering energy. On natural images, more than 99% 
of the scattering energy is absorbed by the 1/3 lowest 
frequency cosine scattering coefficients. 



3.4 Fast Scattering Computations 

Section 13.11 shows that the scattering energy is concen- 
trated along frequency-decreasing paths p = {2^^rk)k 
satisfying 2"-^ < 2^'^+^ < 2^^. If the wavelet transform 
is computed along C directions then the total number 
of frequency-decreasing paths of length m is C'^(^). 
Since (j)2J is a low-pass filter, Sj[p]x{u) = [/[p]x^^2^('^) 
can be uniformly sampled at intervals a2^, with a = 1 
or a = 1/2. If x{n) is a discrete image with N pix- 
els, then each Sj[p]x has 2~'^'^a~'^N coefficients. The 
scattering representation along all frequency-decreasing 
paths of length at most m thus has a total number 
of coefficients equal to Nj = A^of-22-2^ J^^^^ C^(^). 
This reduced scattering representation is computed by 
a cascade of convolutions, modulus, and sub-samplings, 
with O(A^logA^) operations. The final DCT transform 
further compresses the resulting representation. 



Let us recall from Section 12.31 that scattering coeffi- 
cients are computed by iteratively applying the one-step 
propagator Uj. To compute subsampled scattering coeffi- 
cients along frequency-decreasing paths, this propagator 
is truncated. For any scale 2^, Uk,j transforms a signal 
x{2^an) into 

Ukjx=\xi^(pj{2'^an), Ix^ ?/^2Jr(2-^<^^)| r 

I J -J<j<k,reG+ 

(16) 
The algorithm computes subsampled scattering coeffi- 
\ cients by iterating on this propagator. 

Algorithm 1 Reduced Scattering Transform 

Compute Uo^j{x) 
Output X ^ (j)2J (2^an) 
for m = 1 to mmax — 1 do 

for all > ji > ... > jm > -J do 

for all (ri,...,r<^) G G+^ do 
if 771 = mmax — 1 then 

Compute |||x^V^2Jirik---kV^2Jrr.r^k^2^(2^Q^n) 
else 

Compute Uj^^j{\\\x ^ V^2nri | ^ ...| ^ V^2^-r^ I) 

end if 

Output |||^^'0ji,7i| ^ ...| ^'0jq,7q| ^ (j) J {2'^ oin) 
end for 
end for 
end for 



If X is a signal of size P then FFT's compute Uk.jx 
with 0(P log P) operations. A reduced scattering trans- 
form thus computes its Nj = Na-'^2-'^'^ Y2Z^q ^"^(m) 
coefficients with O(A^jlogA^) operations. If mmax = 2 
then Nj = Na-^2-^\CJ + C^J{J - l)/2). It decreases 
exponentially when the scale 2'^ increases. 

Scattering coefficients are decorrelated with a sepa- 
rable DCT along each scale variable ji = ji , J2 = 
J2 - ji, ...,jm = jm - jm-1 and each rotation angle 
variable 6>i, 6>2, ••••,Omf which also requires 0{Nj log N) 
operations. For natural images, more than 99.5 % of the 
total signal energy is carried by the resulting Nj/2 cosine 
scattering coefficients of lower frequencies. 

Numerical computations in this paper are performed 
by rotating wavelets along C = 6 directions, for scat- 
tering representations of maximum order mmax = 2. 
The resulting size of a reduced cosine scattering repre- 
sentation has at most three times as many coefficients 
as a dense SIFT representation. SIFT represents small 
blocks of 4^ pixels with 8 coefficients. A cosine scattering 
representation represents each image block of 2^^ pixels 
by Nj2^J/{2N) = {CJ + C^J{J - l)/2)/2 coefficients, 
which is equal to 24 for C = 6 and J = 2. The cosine 
scattering transform is thus three times the size of SIFT 
for J = 2, but as J increases, the relative size decreases. 
If J = 3 then the size of a cosine scattering representation 
is twice the size of a SIFT representation but for J = 7 
it is about 20 times smaller. 
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Fig. 6. (A): Sorted variances of scattering coefficients for m = 1 (left) and m = 2 (right). (B): Sorted variances of DCT 
scattering coefficients. (C): Variances in the scattering Karhunen-Loeve basis. 



4 Classification Using Scattering Vec- 
tors 

A scattering transform eliminates the image variability 
due to translation and is stable to deformations. The 
resulting classification properties are studied with a PCA 
and an SVM classifier applied to scattering representa- 
tions computed with a Morlet wavelet. State-of-the-art 
results are obtained for hand-written digit recognition 
and for texture discrimination. 



4.1 PCA Affine Scattering Space Selection 

Although discriminant classifiers such as SVM have 
better asymptotic properties than generative classifiers 
Il26l , the situation can be inverted for small training 
sets. We introduce a simple robust generative classifier 
based on affine space models computed with a PCA. 
Applying a DCT on scattering coefficients has no effect 
on any linear classifier because it is a linear orthogonal 
transform. However, keeping the 50% lower frequency 
cosine scattering coefficients reduces computations and 
has a negligible effect on classification results. The clas- 
sification algorithm is described directly on scattering 
coefficients to simplify explanations. Each signal class is 
represented by a random vector X/e, whose realizations 
are images of N pixels in the class. 

Let E{SjX) = {E{Sj[p]X{u))}p^u be the family 
of Nj expected scattering values, computed along all 
frequency-decreasing paths of length m < mmax and 
all subsampled positions u = a2^n. The difference 
SjXk — E{SjXk) is approximated by its projection in 
a linear space of low dimension d <C Nj. The covariance 
matrix of SjXk is a matrix of size N'j. Let \d,k be the 
linear space generated by the d PCA eigenvectors of 
this covariance matrix having the largest eigenvalues. 
Among all linear spaces of dimension d, this is the 
space which approximates SjXk — E{SjXk) with the 
smallest expected quadratic error. This is equivalent 
to approximating SjXk by its projection on an affine 
approximation space: 

^d,k = E{SjXk} + ^d,k' 



The resulting classifier associates a signal X to the 
class k which yields the best approximation space: 



A:(X) = argmin||5jX 

k<K 



Pa,ASjX)\\ 



(17) 



The minimization of this distance has similarities with 
the minimization of a tangential distance [12J in the 
sense that we remove the principal scattering directions 
of variabilities to evaluate the distance. However it is 
much simpler since it does not evaluate a tangential 
space which depends upon Sjx. Let V^^ be the orthog- 
onal complement of 'Vd.k corresponding to directions 
of lower variability. This distance is also equal to the 
norm of the difference between Sjx and the average 
class "template'' E{SjXk), projected in Vj-^: 

\\Sjx-Pa,,ASjx)\\ = \\p^^^^(^Sjx-E{SjXk))\\ . (18) 

Minimizing the affine space approximation error is thus 
equivalent to finding the class centroid E{SjXk) which 
is the closest to Sjx, without taking into account the 
first d principal variability directions. The d principal 
directions of the space \d,k result from deformations and 
from structural variability. The projection Px^^{Sjx) 
is the optimum linear prediction of Sjx from these 
d principal modes. The selected class has the smallest 
prediction error. 

This affine space selection is effective if SjXk — 
E{SjXk) is well approximated by a projection in a low- 
dimensional space. This is the case if realizations of Xk 
are translations and limited deformations of a single tem- 
plate. Indeed, the Lipschitz continuity condition implies 
that small deformations are linearized by the scattering 
transform. Hand-written digit recognition is an example. 
This is also valid for stationary textures where SjXj^ has 
a small variance, which can be interpreted as structural 
variability. 

The dimension d must be adjusted so that SjXj^ has 
a better approximation in the affine space A^ ^ than in 
affine spaces A^^k' of other classes k' ^ k. This is a 
model selection problem, which requires to optimize the 
dimension d in order to avoid over-fitting [5J. 

The invariance scale 2'^ must also be optimized. When 
the scale 2'^ increases, translation invariance increases 
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but it comes with a partial loss of information which 
brings the representations of different signals closer. One 
can prove [22J that for any x and x' 

WSj+^x - Sj+^x'W <\\Sjx - Sjx'W . 

When 2'^ goes to infinity, this scattering distance con- 
verges to a non-zero value. To classify deformed tem- 
plates such as hand-written digits, the optimal 2^ is of 
the order of the maximum pixel displacements due to 
translations and deformations. In a stochastic framework 
where x and x' are realizations of stationary processes, 
Sjx and Sjx' converge to the expected scattering trans- 
forms Sx and Sx' . In order to classify stationary pro- 
cesses such as textures, the optimal scale is the maximum 
scale equal to the image width, because it minimizes the 
variance of the windowed scattering estimator. 

A cross-validation procedure is used to find the di- 
mension d and the scale 2^ which yield the smallest 
classification error. This error is computed on a subset 
of the training images, which is not used to estimate the 
covariance matrix for the PC A calculations. 

As in the case of SVM, the performance of the affine 
PCA classifier can be improved by equalizing the de- 
scriptor space. Table [l] shows that scattering vectors have 
unequal energy distribution along its path variables, in 
particular as the order varies. A robust equalization 
is obtained by re-normalizing each Sj[p\X{u) by the 

maximum \\Sj[p]Xi\\ = (En I^^W^^HI")'^' over all 
training signals Xi\ 



Sj\p\X{u) 



(19) 



To simplify notations, we still write SjX for this nor- 
malized scattering vector. 

Affine space scattering models can be interpreted as 
generative models computed independently for each 
class. As opposed to discriminative classifiers such as 
SVM, they do not estimate cross-terms between classes, 
besides from the choice of the model dimensionality 
d. Such estimators are particularly effective for small 
number of training samples per class. Indeed, if there 
are few training samples per class then variance terms 
dominate bias errors when estimating off-diagonal co- 
variance coefficients between classes [4J. 

An affine space approximation classifier can also be 
interpreted as a robust quadratic discriminant classifier 
obtained by coarsely quantizing the eigenvalues of the 
inverse covariance matrix. For each class, the eigenval- 
ues of the inverse covariance are set io {) \n\d,k and to 
1 in Vj-^, where d is adjusted by cross-validation. This 
coarse quantization is justified by the poor estimation 
of covariance eigenvalues from few training samples. 
These affine space models will typically be applied to 
distributions of scattering vectors having non-Gaussian 
distributions, where a Gaussian Fisher discriminant can 
lead to important errors. 



4.2 Handwritten Digit Recognition 

The MNIST database of hand-written digits is an ex- 
ample of structured pattern classification, where most 
of the intra-class variability is due to local translations 
and deformations. It comprises at most 60000 training 
samples and 10000 test samples. If the training dataset 
is not augmented with deformations, the state of the art 
was achieved by deep-learning convolutional networks 
[29J, deformation models [15J, and dictionary learning 
[25J. These results are improved by a scattering classifier. 

All computations are performed on the reduced cosine 
scattering representation described in Section 13.31 which 
keeps the lower-frequency half of the coefficients. Table 
m computes classification errors on a fixed set of test 
images, depending upon the size of the training set, 
for different representations and classifiers. The affine 
space selection of section 14.11 is compared with an SVM 
classifier using RBF kernels, which are computed us- 
ing Libsvm [9J, and whose variance is adjusted using 
standard cross-validation over a subset of the training 
set. The SVM classifier is trained with a renormalization 
which maps all coefficients to [—1,1]. The PCA classifier 
is trained with the renormalisation (p^ . The first two 
columns of Table |4] show that classification errors are 
much smaller with an SVM than with the PCA algo- 
rithm if applied directly on the image. The 3rd and 4th 
columns give the classification error obtained with a 
PCA or an SVM classification applied to the modulus 
of a windowed Fourier transform. The spatial size 2^ of 
the window is optimized with a cross-validation which 
yields a minimum error for 2^ = 8. It corresponds 
to the largest pixel displacements due to translations 
or deformations in each class. Removing the complex 
phase of the windowed Fourier transform yields a locally 
invariant representation but whose high frequencies are 
unstable to deformations, as explained in Section 12.11 
Suppressing this local translation variability improves 
the classification rate by a factor 3 for a PCA and by 
almost 2 for an SVM. The comparison between PCA 
and SVM confirms the fact that generative classifiers 
can outperform discriminative classifiers when training 
samples are scarce [26]. As the training set size increases, 
the bias-variance trade-off turns in favor of the richer 
SVM classifiers, independently of the descriptor. 

Columns 6 and 8 give the PCA classification result 
applied to a windowed scattering representation for 
^max = 1 and mmax = 2. The cross validation also 
chooses 2^ = 8. For the digit '3', Figure [71 displays 
the 4-by-4 array of normalized scattering vectors. For 
each u = 2^(ni,n2) with 1 < n^ < 4, the scattering 
vector Sj[p]X{u) is displayed for paths of length m = 1 
and m = 2, as circular frequency energy distributions 
following Section 12.31 

Increasing the scattering order from mmax = 1 to 
^max = 2 reduces errors by about 30%, which shows that 
second order coefficients carry important information 
even at a relatively small scale 2^ = 8. However, third 
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Fig. 7. (a): Image X(2x) of a digit '3'. (b): Array of scattering vectors Sj[p\X{u), for m 
2^ = 8. (c): Scattering vectors Sj[p]X{u), for m = 2. 



(c) 

1 and ix sampled at intervals 



order coefficients have a negligible energy and including 
them brings marginal classification improvements, while 
increasing computations by an important factor. As the 
learning set increases in size, the classification improve- 
ment of a scattering transform increases relatively to 
windowed Fourier transform because the classification 
is able to incorporate more high frequency structures, 
which have deformation instabilities in the Fourier do- 
main as opposed to the scattering domain. 

Table m also shows that below 5-10^ training samples, 
the scattering PCA classifier improves results of a deep- 
learning convolutional networks, which learns all filter 
coefficients with a back-propagation algorithm [18J. As 
more training samples are available, the flexibility of the 
SVM classifier brings an improvement over the more 
rigid affine classifier, yielding a 0.43% error rate on the 
original dataset, thus improving upon previous state of 
the art methods. 

To evaluate the precision of the affine space model, 
we compute the relative affine approximation error, av- 
eraged over all classes: 



K 



4 



^-^E 



k=l 



E{\\SjXk-PA,,,{SjXkm 
E{\\SjXkP) 



For any SjXk, we also calculate the minimum approx- 
imation error produced by another affine model A^^k' 
with k' ^ k: 



X,.= 



Ejmmk'^k \\SjXk - PA^,JSjXk)f) 
E{\\SjXk-PA,JSjXkW) 



For a scattering representation with mmax = 2, Table 
|5] gives the dimension d of affine approximation spaces 
optimized with a cross validation, with the correspond- 
ing values of a^ and A^^. When the training set size 
increases, the model dimension d increases because there 
are more samples to estimate each intra-class covariance 
matrix. The approximation model becomes more precise 
so <j^ decreases and the relative approximation error A^^ 
produced by wrong classes increases. This explains the 
reduction of the classification error rate observed in Table 
IHas the training size increases. 



TABLE 5 
Values of the dimension d of affine approximation models 

on MNIST classification, of the intra class normalized 

approximation error aj, and of the ratio A^ between inter 

class and intra class approximation errors, as a function 

of the training size. 



Training 


d 


-^ 


Ad 


300 
5000 
40000 


5 
100 
140 


3-10-^ 
4-10-2 
2-10-2 


2 
3 

4 



TABLE 6 
Percentage of errors for the whole USPS database. 



Tang. 
Kern. 


Scat, mmax = 2 
SVM 


Scat, mmax = 1 
PCA 


Scat, mmax = 2 
PCA 


2.4 


2.7 


3.24 


2.6 / 2.3 



The US-Postal Service is another handwritten digit 
dataset, with 7291 training samples and 2007 test images 
16x16 pixels. The state of the art is obtained with tangent 
distance kernels [[T2|. Table [6] gives results obtained 
with a scattering transform with the PCA classifier for 
^max = 1,2. The cross-validation sets the scattering scale 
to 2^ = 8. As in the MNIST case, the error is reduced 
when going from mmax = 1 to mmax = 2 but remains 
stable for mmax = 3. Different renormalization strategies 
can bring marginal improvements on this dataset. If the 
renormalization is performed by equalizing using the 
standard deviation of each component, the classification 
error is 2.3% whereas it is 2.6% if the supremum is 
normalized. 

The scattering transform is stable but not invariant 
to rotations. Stability to rotations is demonstrated over 
the MNIST database in the setting defined in [16J. A 
database with 12000 training samples and 50000 test 
images is constructed with random rotations of MNIST 
digits. The PCA affine space selection takes into account 
the rotation variability by increasing the dimension d 
of the affine approximation space. This is equivalent 
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TABLE 4 
MNIST classification results. 



Training 




X 


Wind. 


Four. 


Scat. 


mmax = 1 


Scat. 


mmax = 2 


Conv. 


size 


PCA 


SVM 


PCA 


SVM 


PCA 


SVM 


PCA 


SVM 


Net. 


300 


14.5 


15.4 


7.35 


7.4 


5.7 


8 


4.7 


5.6 


7.18 


1000 


7.2 


8.2 


3.74 


3.74 


2.35 


4 


2.3 


2.6 


3.21 


2000 


5.8 


6.5 


2.99 


2.9 


1.7 


2.6 


1.3 


1.8 


2.53 


5000 


4.9 


4 


2.34 


2.2 


1.6 


1.6 


1.03 


1.4 


1.52 


10000 


4.55 


3.11 


2.24 


1.65 


1.5 


1.23 


0.88 


1 


0.85 


20000 


4.25 


2.2 


1.92 


1.15 


1.4 


0.96 


0.79 


0.58 


0.76 


40000 


4.1 


1.7 


1.85 


0.9 


1.36 


0.75 


0.74 


0.53 


0.65 


60000 


4.3 


1.4 


1.80 


0.8 


1.34 


0.62 


0.7 


0.43 


0.53 



TABLE 7 
Percentage of errors on an MNIST rotated dataset ©J. 



Scat, mmax = 1 
PCA 


Scat, mmax = 2 
PCA 


Conv. 

Net. 


8 


4.4 


8.8 



TABLE 8 

Percentage of errors on scaled and/or rotated MNIST 

digits 



Transformed 


Scat, mmax = 1 


beat. TTZmax 


= 2 


Images 


PCA 


PCA 




Without 


1.6 


0.8 




Rotation 


6.7 


3.3 




Scaling 


2 


1 




Rot. + Seal. 


12 


5.5 





to projecting the distance to the class centroid on a 
smaller orthogonal space, by removing more principal 
components. The error rate in Table [Zl is much smaller 
with a scattering PCA than with a convolution network 
116)1 . Much better results are obtained for a scattering 
with mmax = 2 than with mmax = 1 because second order 
coefficients maintain enough discriminability despite the 
removal of a larger number d of principal directions. In 
this case, mmax = 3 marginally reduces the error. 

Scaling invariance is studied by introducing a random 
scaling factor uniformly distributed between l/>/2 and 
a/2. In this case, the digit '9' is removed from the 
database as to avoid any indetermination with the digit 
'6' when rotated. The training set has 9000 samples (1000 
samples per class). Table [8] gives the error rate on the 
original MNIST database and including either rotation, 
scaling, or both in the training and testing samples. 
Scaling has a smaller impact on the error rate than 
rotating digits because scaled scattering vectors span 
an invariant linear space of lower dimension. Second- 
order scattering outperforms first-order scattering, and 
the difference becomes more significant when rotation 
and scaling are combined, because it provides interaction 
coefficients which are discriminative even in presence of 
scaling and rotation variability. 



4.3 Texture Discrimination 

Visual texture discrimination remains an outstanding im- 
age processing problem because textures are realizations 
of non-Gaussian stationary processes, which cannot be 
discriminated using the power spectrum. Depending on 
the imaging conditions, textures undergo transforma- 
tions due to illumination, rotation, scaling or more com- 
plex deformations when mapped on three-dimensional 
surfaces. The affine PCA space classifier removes most of 
the variability of SjX — E{SjX} within each class. This 
variability is due to the residual stochastic variability 
which decays as J increases and to variability due to 
illumination, rotation and perspective effects. 

Texture classification is tested on the CUReT texture 
database (TH, ||35||, which includes 61 classes of image 
textures of A^ = 200^ pixels. Each texture class gives 
images of the same material with different pose and 
illumination conditions. Specularities, shadowing and 
surface normal variations make classification challeng- 
ing. Pose variation requires global rotation and illumi- 
nation invariance. Figure [8] illustrates the large intra- 
class variability, after a normalization of the mean and 
variance of each textured image. 

Table [9] compares error rates obtained with different 
classifiers. The database is randomly split into a training 
and a testing set, with 46 training images for each class 
as in [35J. Results are averaged over 10 different splits. A 
PCA affine space classifier applied directly on the image 
yields a large classification error of 17%. To estimate 
the Fourier spectrum, windowed Fourier transforms are 
computed over half-overlapping windows of size 2^, 
and their squared modulus is averaged over the whole 
image. This averaging is necessary to reduce the spec- 
trum estimator variance, which does not decrease when 
the window size 2^^ increases. The cross-validation sets 
the optimal window scale to 2^ = 32, whereas images 
have a width of 200 pixels. The error drops to 1%. This 
simple Fourier spectrum yields a smaller error than pre- 
viously reported state-of-the-art methods. SVM's applied 
to a dictionary of textons yield an error rate of 1.53% 
[13J, whereas an optimized Markov Random Field model 
computed with image patches ||35| achieves an error of 
2.46%. 

For the scattering PCA classifier, the cross validation 
chooses an optimal scale 2^^ equal to the image width 
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Fig. 8. Examples of textures from the CUReT database with normalized mean and variance. Each row corresponds to 
a different class, showing intra-class variability in the form of stochastic variability and changes in pose and illumination. 

TABLE 9 
Percentage of errors on CUReT for different training sizes. 



Training 
size 


X 
PCA 


Four. Spectr. 
PCA 


Scat, mmax = 1 
PCA 


Scat, mmax = 2 
PCA 


Textons 
SVM |13| 


MRF 

|35| 


46 


17 


1 


0.5 


0.2 


1.53 


2.4 






(a) 



(b) 



(c) 



Fig. 9. (a): Example of CureT texture X{u). (b): Scattering coefficients Sj[p]X, for m = 1 and 2'^ equal to the image 
width, (c): Scattering coefficients Sj[p\X{u), for m = 2. 



to reduce the scattering estimation variance. Indeed, 
contrarly to a power spectrum estimation, the variance of 
the scattering vector decreases when 2'^ increases. Figure 
|9] displays the scattering coefficients Sj[p\X of order 
m = 1 and m = 2 of a CureT textured image X. When 
^max = 1/ the error drops to 0.5%, although first-order 
scattering coefficients essentially depend upon second 
order moments as the Fourier spectrum. This is probably 
due to the fact that image textures have a spectrum 
which typically decays like |cj|~". For such spectrum, 
an estimation over dyadic frequency bands provide a 
better bias versus variance trade-off than a windowed 
Fourier spectrum [IJ. For rumax = 2, the error further 
drops to 0.2%. Indeed, scattering coefficients of order 
m = 2 depend upon moments of order 4, which are 
necessary to differentiate textures having same second 
order moments as in Figure O The dimension of the 
affine approximation space model is (i = 16, the intra- 



class normalized approximation error is a^ = 2.5 • 10 
and the separation ratio is A^^ = 3 for mmax = 2. 



The PCA classifier provides a partial rotation invari- 
ance by removing principal components. It averages 
scattering coefficients along path rotation parameters, 
which comes with a loss of discriminability However, 
a more efficient rotation invariant texture classification 
is obtained by cascading this translation invariant scat- 
tering with a second rotation invariant scattering [24). 
It transforms each layer of the translation invariant 
scattering network with new wavelet convolutions along 
rotation parameters, followed by modulus and average 
pooling operators, which are cascaded. A combined 
translation and rotation scattering yields a translation 
and rotation invariant representation which is stable to 
deformations [|22|. 
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5 Conclusion 

A wavelet scattering transform computes a translation 
invariant representation, which is stable to deforma- 
tion, using a deep convolution network architecture. 
The first layer outputs SIFT-type descriptors, which are 
not sufficiently informative for large-scale invariance. 
Classification performance is improved by adding other 
layers providing complementary information. A reduced 
cosine scattering transform is at most three times larger 
than a SIFT descriptor and computed with O(A^logA^) 
operations. 

State-of-the-art classification results are obtained for 
handwritten digit recognition and texture discrimina- 
tion, with an SVM or a PCA classifier. If the data 
set has other sources of variability due to the action 
of other finite Lie groups such as rotations, then this 
variability can be eliminated with an invariant scattering 
computed by cascading wavelet transforms defined on 
these groups |l22l, (Zi). 

However, signal classes may also include complex 
sources of variability that can not be approximated by 
the action of a finite group, as in CalTechlOl or Pascal 
databases. This variability must be taken into account by 
unsupervised optimizations of the representations from 
the training data. Deep convolution networks which 
learn filters from the data iTTSl have the flexibility to 
adapt to such variability, but learning translation in- 
variant filters is not necessary. A wavelet scattering 
transform can be used on the first two network layers, 
while learning the next layer filters applied to scatter- 
ing coefficients. Similarly, bag-of -features unsupervised 
algorithms Il37l , IITI applied to SIFT can potentially be 
improved upon by replacing SIFT descriptors by wavelet 
scattering vectors. 



References 

[1] p. Abry, P. Gongalves, and P. Flandrin, "Wavelets, spectrum anal- 
ysis and 1/f processes". Wavelets and statistics. Lecture Notes in 
Statistics, 1995. 

[2] S. Allassonniere, Y. Amit, A. Trouve, "Toward a coherent statistical 
framework for dense deformable template estimation". Volume 
69, part 1 (2007), pages 3-29, of the Journal of the Royal Statistical 
Society. 

[3] R. Bajcsy and S. Kovacic, "Multi-resolution elastic matching". 
Computer Vision Graphics and Image Processing, vol 46, Issue 
1, April 1989. 

[4] P. J. Bickel and E. Levina: "Covariance regularization by thresh- 
olding". Annals of Statistics, 2008. 

[5] L. Birge and P. Massart. "From model selection to adaptive 
estimation." In Festschrift for Lucien Le Cam: Research Papers 
in Probability and Statistics, 55 - 88, Springer- Verlag, New York, 
1997. 

[6] J. Bruna, "Operators commuting with diffeomorphisms", CMAP 
Tech. Report, Ecole Polytechnique, 2012. 

[7] Y-L. Boureau, F. Bach, Y LeCun, and J. Ponce. "Learning Mid- 
Level Features For Recognition". In IEEE Conference on Com- 
puter Vision and Pattern Recognition, 2010. 

[8] J. Bouvrie, L. Rosasco, T. Poggio: "On Invariance in Hierarchical 
Models", NIPS 2009. 

[9] C. Chang and C. Lin, "LIBSVM : a library for support vector 
machines". ACM Transactions on Intelligent Systems and Tech- 
nology, 2:27:1-27:27, 2011 



[lO: 

[11 

[12: 
[13: 

[14: 

[15 

[16: 
[17: 

[18: 

[19 

[2o: 

[21 

[22: 

[23: 
[24: 

[25 

[26: 

[27: 
[28: 

[29 

[3o: 

[31 
[32 

[33: 
[34: 

[35 

[36: 
[37: 



L. Fei-Fei, R. Fergus and P. Perona. "Learning generative visual 
models from few training examples: an incremental Bayesian 
approach tested on 101 object categories". IEEE. CVPR 2004 
Z. Guo, L. Zhang, D. Zhang, "Rotation Invariant texture classifi- 
cation using LBP variance (LBPV) with global matching", Elsevier 
Journal of Pattern Recognition, Aug. 2009. 

B.Haasdonk, D.Keysers: "Tangent Distance kernels for support 
vector machines", 2002. 

E. Hayman, B. Caputo, M. Fritz and J.O. Eklundh, "On the 
Significance of Real- World Conditions for Material Classification", 
ECCV, 2004. 

K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y LeCun: "What is 
the Best Multi-Stage Architecture for Object Recognition?", Proc. 
of ICCV 2009. 

D.Keysers, T.Deselaers, C.GoUan, H.Ney, "Deformation Models 
for image recognition", IEEE trans of PAMI, 2007. 
H. Larochelle, Y Bengio, J. Louradour, P. Lamblin, "Exploring 
Strategies for Training Deep Neural Networks", Journal of Ma- 
chine Learning Research, Jan. 2009. 

S. Lazebnik, C. Schmid, J.Ponce. "Beyond Bags of Features: Spatial 
Pyramid Matching for Recognizing Natural Scene Categories". 
Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, New York, June 2006, vol. II, pp. 2169-2178. 
Y LeCun, K. Kavukvuoglu and C. Farabet: "Convolutional Net- 
works and Applications in Vision", Proc. of ISCAS 2010. 
T. Leung and J. Malik; "Representing and Recognizing the Visual 
Appearance of Materials Using Three-Dimensional Textons". In- 
ternational Journal of Computer Vision, 43(1), 29-44; 2001. 
W. Lohmiller and J.J.E. Slotine "On Contraction Analysis for 
Nonlinear Systems", Automatica, 34(6), 1998. 
D.G. Lowe, "Distinctive Image Features from Scale-Invariant Key- 
points", International Journal of Computer Vision, 60, 2, pp. 91- 
110, 2004 

S. Mallat "Group Invariant Scattering", to appear in 
"Communications in Pure and Applied Mathematics", 2012, 
http://arxiv.org/abs/1101.2286. 

S. Mallat, "Recursive Interferometric Representation", Proc. of 
EUSICO conference, Denmark, August 2010. 
S.Mallat , L. Sifre : "Combined scattering for rotation invariant 
texture analysis", submitted to ESANN, 2012. 
J. Mairal, F. Bach, J.Ponce, "Task-Driven Dictionary Learning", 
Submitted to IEEE trans, on PAMI, September 2010. 

A. Y Ng and M. I. Jordan "On discriminative vs. generative 
classifiers: A comparison of logistic regression and naive Bayes", 
in Advances in Neural Information Processing Systems (NIPS) 14, 
2002. 

L. Perrinet, "Role of Homeostasis in Learning Sparse Representa- 
tions", Neural Computation Journal, 2010. 

J.Portilla, E.Simoncelli, "A Parametric Texture model based on 
joint statistics of complex wavelet coefficients", IJCV, 2000. 
M. Ranzato, F.Huang, YBoreau, Y LeCun: "Unsupervised Learn- 
ing of Invariant Feature Hierarchies with Applications to Object 
Recognition", CVPR 2007. 

C. Sagiv, N. A. Sochen and Y Y Zeevi, "Gabor Feature Space 
Diffusion via the Minimal Weighted Area Method", Springer 
Lecture Notes in Computer Science, Vol. 2134, pp. 621-635, 2001. 

B. Scholkopf and A. J. Smola, "Learning with Kernels", MIT Press, 
2002. 

S.Soatto, "Actionable Information in Vision", ICCV, 2009. 
E. Tola, VLepetit, P. Fua, "DAISY: An Efficient Dense Descriptor 
Applied to Wide-Baseline Stereo", IEEE trans on PAMI, May 2010. 
A. Trouve, L. Younes, "Local Geometry of Deformable Tem- 
plates", SI AM Journal on Mathematical Analysis. 2005. Volume: 
37, Issue: 1. 

M.Varma, A. Zisserman: "A Statistical Approach To Material 
Classification Using Image Patch Exemplars". IEEE Trans, on 
PAMI, 31(ll):2032-2047, November 2009. 

I. Waldspurger, S. Mallat "Recovering the phase of a complex 
wavelet transform", CMAP Tech. Report, Ecole Polytechnique, 
2012. 

J.Wang, J.Yang, K.Yu, F.Lv, T.Huang, YGong, "Locality- 
constrained Linear Coding for Image Classification", CVPR 
2010. 



